{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n    @Author: King\\n    @Date: 2019.05.16\\n    @Purpose: EDA：最简单的自然语言处理数据增广方法\\n    @Introduction:  EDA：最简单的自然语言处理数据增广方法\\n    @Datasets: Chinese relation extration datasets\\n    @Link : https://mp.weixin.qq.com/s?__biz=MzAxMTU5Njg4NQ==&mid=100001705&idx=5&sn=eaa4b3a5a911f7f4068eff98ab7c7867\\n    @Reference : https://github.com/jasonwei20/eda_nlp\\n    @paper ： \\n'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# encoding = utf8\n",
    "'''\n",
    "    @Author: King\n",
    "    @Date: 2019.05.16\n",
    "    @Purpose: EDA：最简单的自然语言处理数据增广方法\n",
    "    @Introduction:  EDA：最简单的自然语言处理数据增广方法\n",
    "    @Datasets: Chinese relation extration datasets\n",
    "    @Link : https://mp.weixin.qq.com/s?__biz=MzAxMTU5Njg4NQ==&mid=100001705&idx=5&sn=eaa4b3a5a911f7f4068eff98ab7c7867\n",
    "    @Reference : https://github.com/jasonwei20/eda_nlp\n",
    "    @paper ： \n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## EDA：最简单的自然语言处理数据增广方法\n",
    "\n",
    "我向你介绍EDA：简单数据增广技术，可以大大提升文本分类任务的性能（在EDA Github repository有简单的实现代码）。EDA包含四个简单操作，能极好地防止过拟合，并训练出更强健的模型，分别是：\n",
    "\n",
    "1. 同义词替换：在句子中随机选取n个非停用词。对每个选取的词，用它的随机选取的同义词替换。\n",
    "\n",
    "2. 随机插入：在句子中任意找一个非停用词，随机选一个它的同义词，插入句子中的任意位置。重复n次。\n",
    "\n",
    "3. 随机交换：任意选取句子中的两个词，交换位置。重复n次。\n",
    "\n",
    "4. 随机删除：对于句子中概率为p的每一个词，随机删除。\n",
    "\n",
    "这些技术真有效吗？出乎意料，答案是肯定的。尽管生成的某些句子有点怪异，但是在数据集中的引入一些噪声，对于训练出一个健壮的模型来说，是极有好处的，特别是数据集比较小的时候。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 一、随机采样"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1、引入包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd \n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 功能：内容信息查询方法\n",
    "def describe_df(df):\n",
    "    '''\n",
    "\t功能：df 内容信息查询方法\n",
    "\t:param x:    需要查询信息的变量df\n",
    "    :return:  \n",
    "\t'''\n",
    "    #print(\"Type: {0}\".format(type(df)))\n",
    "    print(\"Shape/size: {0}\".format(len(df)))\n",
    "    print(\"df.iloc[0]:{0}\".format(df.iloc[0])) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2、数据加载"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape/size: 287351\n",
      "df.iloc[0]:e1                           金泰熙\n",
      "e2                            金东\n",
      "r                            NaN\n",
      "s     韩国梦想演唱会第十届2004年:MC:金泰熙，金东万\n",
      "Name: 0, dtype: object\n"
     ]
    }
   ],
   "source": [
    "'''\n",
    "    quoting : int or csv.QUOTE_* instance, default 0\n",
    "        控制csv中的引号常量。可选 QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3)\n",
    "    参考地址：\n",
    "        https://www.cnblogs.com/datablog/p/6127000.html\n",
    "        https://blog.csdn.net/longwei92/article/details/84326138\n",
    "'''\n",
    "type = 'train'\n",
    "base_path = 'E:/pythonWp/game/CCKS2019/RelationshipExtraction/origin_data/'\n",
    "df = pd.read_csv(\"{0}{1}\".format(base_path,\"train_char.txt\"),quoting = 3,sep='\\t',names=['e1','e2','r','s'])\n",
    "describe_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3、将 nan 替换为 NA 字符 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape/size: 287351\n",
      "df.iloc[0]:e1                           金泰熙\n",
      "e2                            金东\n",
      "r                             NA\n",
      "s     韩国梦想演唱会第十届2004年:MC:金泰熙，金东万\n",
      "Name: 0, dtype: object\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "NA                               248850\n",
       "人物关系/亲属关系/配偶/丈夫/现夫                 8135\n",
       "人物关系/亲属关系/血亲/自然血亲/父母/生父            6870\n",
       "人物关系/亲属关系/配偶/妻子/现妻                 5513\n",
       "人物关系/师生关系/老师                       2900\n",
       "人物关系/亲属关系/血亲/自然血亲/子女/儿子            2627\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/哥哥          1673\n",
       "人物关系/社交关系/友谊关系/朋友                  1610\n",
       "人物关系/亲属关系/血亲/自然血亲/父母/生母            1383\n",
       "人物关系/社交关系/感情关系/喜欢                  1301\n",
       "人物关系/社交关系/感情关系/恋人                  1266\n",
       "人物关系/亲属关系/血亲/自然血亲/子女/女儿             830\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/妹妹           805\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/弟弟           637\n",
       "人物关系/师生关系/学生                        547\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/姐姐           532\n",
       "人物关系/亲属关系/血亲/自然血亲/祖父母/爷爷            291\n",
       "人物关系/亲属关系/配偶/妻子/前妻                  245\n",
       "人物关系/亲属关系/配偶/丈夫/前夫                  218\n",
       "人物关系/亲属关系/配偶/丈夫/未婚夫                 183\n",
       "人物关系/亲属关系/姻亲/配偶的血亲/配偶的父母/岳父         165\n",
       "人物关系/亲属关系/血亲/自然血亲/侄甥/侄子             158\n",
       "人物关系/亲属关系/姻亲/血亲的配偶/子女的配偶/女婿         119\n",
       "人物关系/亲属关系/血亲/自然血亲/叔伯姑舅姨/舅舅           77\n",
       "人物关系/亲属关系/血亲/自然血亲/叔伯姑舅姨/叔伯           77\n",
       "人物关系/亲属关系/配偶/妻子/未婚妻                  69\n",
       "人物关系/亲属关系/姻亲/血亲的配偶/兄弟姐妹的配偶/嫂子        67\n",
       "人物关系/亲属关系/血亲/自然血亲/孙子女/孙子             46\n",
       "人物关系/亲属关系/血亲/自然血亲/祖父母/奶奶             40\n",
       "人物关系/亲属关系/血亲/自然血亲/侄甥/侄女              30\n",
       "人物关系/亲属关系/姻亲/配偶的血亲/配偶的父母/公公          24\n",
       "人物关系/亲属关系/血亲/自然血亲/叔伯姑舅姨/姑妈           22\n",
       "人物关系/亲属关系/血亲/自然血亲/孙子女/孙女             19\n",
       "人物关系/亲属关系/姻亲/血亲的配偶/子女的配偶/儿媳          13\n",
       "人物关系/亲属关系/血亲/自然血亲/外祖父母/外公             9\n",
       "Name: r, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 将 nan 替换为 NA 字符 \n",
    "df_all = df.fillna(\"NA\")\n",
    "describe_df(df_all)\n",
    "# pandas统计个数 参考：https://blog.csdn.net/waple_0820/article/details/80514073\n",
    "df_all['r'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4、清除 NA 关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape/size: 38501\n",
      "df.iloc[0]:e1                          窦滔\n",
      "e2                          苏蕙\n",
      "r           人物关系/亲属关系/配偶/妻子/现妻\n",
      "s     苏蕙思念不已，织绵为《回文璇玑图诗》，寄赠窦滔。\n",
      "Name: 248850, dtype: object\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "人物关系/亲属关系/配偶/丈夫/现夫               8135\n",
       "人物关系/亲属关系/血亲/自然血亲/父母/生父          6870\n",
       "人物关系/亲属关系/配偶/妻子/现妻               5513\n",
       "人物关系/师生关系/老师                     2900\n",
       "人物关系/亲属关系/血亲/自然血亲/子女/儿子          2627\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/哥哥        1673\n",
       "人物关系/社交关系/友谊关系/朋友                1610\n",
       "人物关系/亲属关系/血亲/自然血亲/父母/生母          1383\n",
       "人物关系/社交关系/感情关系/喜欢                1301\n",
       "人物关系/社交关系/感情关系/恋人                1266\n",
       "人物关系/亲属关系/血亲/自然血亲/子女/女儿           830\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/妹妹         805\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/弟弟         637\n",
       "人物关系/师生关系/学生                      547\n",
       "人物关系/亲属关系/血亲/自然血亲/兄弟姐妹/姐姐         532\n",
       "人物关系/亲属关系/血亲/自然血亲/祖父母/爷爷          291\n",
       "人物关系/亲属关系/配偶/妻子/前妻                245\n",
       "人物关系/亲属关系/配偶/丈夫/前夫                218\n",
       "人物关系/亲属关系/配偶/丈夫/未婚夫               183\n",
       "人物关系/亲属关系/姻亲/配偶的血亲/配偶的父母/岳父       165\n",
       "人物关系/亲属关系/血亲/自然血亲/侄甥/侄子           158\n",
       "人物关系/亲属关系/姻亲/血亲的配偶/子女的配偶/女婿       119\n",
       "人物关系/亲属关系/血亲/自然血亲/叔伯姑舅姨/舅舅         77\n",
       "人物关系/亲属关系/血亲/自然血亲/叔伯姑舅姨/叔伯         77\n",
       "人物关系/亲属关系/配偶/妻子/未婚妻                69\n",
       "人物关系/亲属关系/姻亲/血亲的配偶/兄弟姐妹的配偶/嫂子      67\n",
       "人物关系/亲属关系/血亲/自然血亲/孙子女/孙子           46\n",
       "人物关系/亲属关系/血亲/自然血亲/祖父母/奶奶           40\n",
       "人物关系/亲属关系/血亲/自然血亲/侄甥/侄女            30\n",
       "人物关系/亲属关系/姻亲/配偶的血亲/配偶的父母/公公        24\n",
       "人物关系/亲属关系/血亲/自然血亲/叔伯姑舅姨/姑妈         22\n",
       "人物关系/亲属关系/血亲/自然血亲/孙子女/孙女           19\n",
       "人物关系/亲属关系/姻亲/血亲的配偶/子女的配偶/儿媳        13\n",
       "人物关系/亲属关系/血亲/自然血亲/外祖父母/外公           9\n",
       "Name: r, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 清除 NA 关系\n",
    "df_rm_NA = df.dropna() \n",
    "describe_df(df_rm_NA)\n",
    "# pandas统计个数 参考：https://blog.csdn.net/waple_0820/article/details/80514073\n",
    "df_rm_NA['r'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5、获得 NA 关系数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape/size: 248850\n",
      "df.iloc[0]:e1                           金泰熙\n",
      "e2                            金东\n",
      "r                            NaN\n",
      "s     韩国梦想演唱会第十届2004年:MC:金泰熙，金东万\n",
      "Name: 0, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# 获得 NA 关系数据\n",
    "df_na = df[df['r'].isnull().values==True]\n",
    "describe_df(df_na)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6、对 NA 关系数据进行随机抽样"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape/size: 24885\n",
      "df.iloc[0]:e1                                孔子\n",
      "e2                               匡亚明\n",
      "r                                NaN\n",
      "s     匡亚明和他的《孔子评传》人民日报（海外版）1985年9月8日\n",
      "Name: 169508, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# 对 NA 关系数据进行 10% 随机抽样\n",
    "# 参考：python-Pandas学习 如何对数据集随机抽样？ https://blog.csdn.net/qq_22238533/article/details/71080942\n",
    "simple_rate =0.1\n",
    "df_na_sample = df_na.sample(n=int(len(df_na)*simple_rate))\n",
    "describe_df(df_na_sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7、对 抽样  的Na 数据 与 非 NA 的数据合并"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shape/size: 63386\n",
      "df.iloc[0]:e1                          窦滔\n",
      "e2                          苏蕙\n",
      "r           人物关系/亲属关系/配偶/妻子/现妻\n",
      "s     苏蕙思念不已，织绵为《回文璇玑图诗》，寄赠窦滔。\n",
      "Name: 248850, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# 对 抽样  的Na 数据 与 非 NA 的数据合并\n",
    "# 合并非空的列表 test_info_vec_df_rm_NA、随机抽样后的空列表 test_info_vec_df_na_sample\n",
    "# 现将表构成list，然后在作为concat的输入\n",
    "# 参考：PANDAS 数据合并与重塑（concat篇）: https://blog.csdn.net/stevenkwong/article/details/52528616\n",
    "frames = [df_rm_NA,df_na_sample]\n",
    "new_sample_df = pd.concat(frames)\n",
    "describe_df(new_sample_df)\n",
    "#new_test_df.to_csv(\"sample50_test.txt\",index=None,header=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 8、采样数据保存"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils import shuffle\n",
    "new_sample_df = shuffle(new_sample_df)\n",
    "\n",
    "new_sample_df.to_csv(\"sample{0}_{1}_sent.txt\".format(str(int(simple_rate*100)),type),index=None,header=None,sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
